Genome-wide assessment of population structure and association mapping for agronomic and grain nutritional traits in proso millet (Panicum miliaceum L.)

Proso millet is an important but under-researched and underutilized crop with the potential to become a future smart crop because of its climate-resilient features and high nutrient content. Assessing diversity and marker-trait associations are essential to support the genomics-assisted improvement of proso millet. This study aimed to assess the population structure and diversity of a proso millet diversity panel and identify marker-trait associations for agronomic and grain nutrient traits. In this study, genome-wide single nucleotide polymorphisms (SNPs) were identified by mapping raw genotyping-by-sequencing (GBS) data onto the proso millet genome, resulting in 5621 quality-filtered SNPs in 160 diverse accessions. The modified Roger's Distance assessment indicated an average distance of 0.268 among accessions, with the race miliaceum exhibiting the highest diversity and ovatum the lowest. Proso millet germplasm diversity was structured according to geographic centers of origin and domestication. Genome-wide association mapping identified 40 marker-trait associations (MTAs), including 34 MTAs for agronomic traits and 6 for grain nutrients; 20 of these MTAs were located within genes. Favourable alleles and phenotypic values were estimated for all MTAs. This study provides valuable insights into the population structure and diversity of proso millet, identified marker-trait associations, and reported favourable alleles and their phenotypic values for supporting genomics-assisted improvement efforts in proso millet.


SNP diversity and population structure
For the final dataset, after filtering, we retained 160 accessions with 5621 SNPs.The SNP counts varied from 204 on chromosome 10 to 476 on chromosome 3 (Table 1).The SNP distribution across the 18 chromosomes is shown in Fig. 2. Analysis of the position and distribution of each SNP locus at the whole-genome level showed that 68% of the SNPs were within 100 kbp of adjacent SNPs.Further, using SnpEff 27 , each SNP locus was annotated based on its genomic location to predict coding effects.It was found that 4.5% of the SNP loci were in the exon regions and 15.6% in intergenic regions, while 43.7% and 36.18% of SNPs were in the downstream and upstream regions of the gene, respectively.

AMOVA (analysis of molecular variance)
The AMOVA indicated the highest contribution of variation within a race (71%) and within the region (66.9%).However, a low but significant contribution was observed between races and regions (4% and 8.4%, respectively) (Table 2).This implies that traditional race classifications based on morphology are weakly correlated with the underlying population genetics.

Genetic distance
The average Modified Roger's Distance (MRD) of the entire set was 0.268 and ranged from 0.126 to 0.341.Among the races, accessions belonging to the miliaceum race had the highest average distance (0.274), whereas the lowest distance was observed among accessions within the ovatum race (0.201) (Supplementary Table 1).Figure 3 shows the minimum, maximum and median MRD of the entire set within and between the races.Among races, the lowest distance was observed between ovatum and contractum (0.248), followed by ovatum and compactum (0.250), whereas race patentissimum showed a higher distance from other races (0.260-0.272) (Supplementary Table 1).Principal component analysis and ADMIXTURE were used to infer the population structure of the collection, and clear subpopulation structures were observed (Fig. 4a).The first two principal components explained 28% of the total genetic variation and aided in visually differentiating the substructures within Asian accessions.PCA also showed the presence of a substructure within the collection, based on the regions and countries within the regions.A grouping of a small clump of 14 Asian accessions in the top center of the PCA biplot showed that these accessions were diverse from the other Asian accessions; on further observation, 12 of these 14 accessions were of Korean origin, while the remaining two accessions were from China and Germany.In addition, from the    www.nature.com/scientificreports/biplot, it can be seen that a diverse set of 20 Asian accessions clustered at the bottom-right.Further observation of the countries of sample collection found that most of the accessions were of Indian origin (14 accessions from India) and one accession from Sri Lanka, while other accessions were from Mexico, Syria, and other countries.The hierarchical population structure, using the model-based ADMIXTURE program, was run assuming K = 1 to 10 populations without providing any prior information on population structure.The obtained CV values and the corresponding ΔCV, combined with a line graph using CV errors for each K, showed that the CV error decreased steadily up to K = 5 and increased afterwards.This suggested the presence of five natural subpopulations (K = 5) (Fig. 4b) within our proso millet collection, and a K value of 5 was considered an appropriate population structure.The five populations were named POP1-POP5 (Fig. 4c).The accessions that had population membership of < 0.6 in all five populations were considered admixtures (accessions with genomes of two or more populations).
We assigned individuals to any of the five subpopulations considering the maximum proportion of membership.Accordingly, POP1, POP2, POP3, and POP5 represent most accessions from Asia, whereas POP4 represents most accessions from Europe (18 out of 25 accessions, including two unknown origins, one from America, and four from Asia).Accessions from Korea were grouped as POP2.Accessions of the race miliaceum dominated in all the populations, while the majority of accessions belonging to compactum were in the POP1, ovatum in the POP3, patentissimum in the POP 5, contractum in the POP 1 and POP 3.These distributions show that the proso millet accessions were not structured as per racial groups, while they were structured according to regions and countries within regions (for example, Korea in POP 2, Russia in POP 4), as observed in PCA.Overall, 93 of the 160 accessions had a population membership of > 0.90, while 31 accessions had a population membership of < 0.60.Approximately 17% of accessions (12 accessions) from Asia were admixture (< 0.60), while 37% of accessions from Europe had admixtures with different populations.Hierarchical clustering based on the calculated MRD showed the presence of five major clusters (Fig. 5).The cluster dendrogram results also agreed with the admixture-based population structure in terms of both the number of populations and the presence of population structure based on regions.Cluster-5 was dominated by Korean accessions and cluster-4 was dominated by Indian accessions.Combining the cluster dendrogram and ADMIXTURE-based population membership, individuals belonging to clusters 1, 4, and 5 had well-defined allelic or membership proportions with fewer admixtures.

Genome-wide association study (GWAS) on agronomic and nutrient traits
GWAS for agronomic traits identified 121, 95, and 95 marker-trait associations (MTAs) for 2015, 2016, and combined datasets, respectively, using FarmCPU with a p-value cutoff of ≤ 0.0001.Furthermore, when we looked for common MTAs across the three datasets (2015, 2016, and combined), 34 SNPs were found to be significantly associated with agronomic traits in at least two of them.Among these 34 MTAs, four SNPs for inflorescence length (Proso.1_8346815,Proso.14_27820106,Proso.12_34047515, and Proso.12_41890075)(Fig. 7) and one for plant height (Proso.7_1535098)were detected significant across all three datasets.Eighteen of the 34 MTAs were located in genes (Table 3).GWAS on grain nutrient traits identified 24, 37, and 26 MTAs for the 2015, 2016 and combined datasets, but only six were found in at least two of the datasets.Of these, two SNPs (Proso.17_30948407and Proso.17_5885921,associated with Zn and Fe, respectively) on chromosome 17 were located in genes PM17G09880 and TE311547.

Comparative genomics
Twenty MTAs that were significantly associated with various traits were located within the genes.Although these specific SNPs are probably not causal polymorphisms for these traits, the rapid LD decay in this population (Fig. 6) implies that the genes have a strong probability of being involved.The sequence information of these genes was retrieved from www. genom eevol ution.com and compared with related species to check the similarity and gene function.More than 90% similarity was considered to report the genes and their functions in related species (Supplementary Table 2).For example, the SNP Proso.2_14901071associated with basal tiller number is located on the gene PM02G15460, showing over 90% similarity with genes in the three species, Panicum hallii, Panicum virgatum, and Setaria italica, with gene functions of "putative leucine-rich repeat-containing protein, sporulation-specific protein 15-like, and girdin-like".The SNP Proso.17_3253916associated with inflorescene length is located on the gene PM17G03120 showed over 90% similarity with genes in the closely related species namely Panicum hallii, Panicum virgatum, Sateria italica, with gene function of "filament-like plant protein".The SNP Proso.14_27820106, which is associated with inflorescene length, is located in the gene PM14G15020 and showed over 90% similarity with Panicum hallii, Panicum virgatum, Sateria viridis, Setaria italica, and Zea mays genes with the gene function of "dihydrolipoyllysine-residue acetyltransferase component 4 of the pyruvate dehydrogenase complex, chloroplastic-like" (Supplementary Table 2).

DISCUSSION
The proso millet diversity panel used in this study had an average MRD of 0.268, which varied from 0.126 to 0.341.Among the five races of proso millet, which are primarily classified on the basis of panicle morphology and shape 14 , accessions belonging to milliaceum had the highest average distance (0.274), whereas accessions of ovatum showed the lowest average distance (0.201).The lowest between-race distance was found between the ovatum with compactum (0.248) and the ovatum with contractum (0.250).Similar results were found when the same panel and the entire proso millet collection were assessed for phenotypic diversity 13,26 .The three races, namely contractum, compactum, and ovatum, look similar, except for panicle morphology: compact and drooping inflorescence in contractum, cylindrical and erect inflorescence in compactum, and compact and slightly curved inflorescence in ovatum 13,14 .These three races phenotypically differ from the other two races, milliaceum and patentissimum, which are often difficult to distinguish.Accessions belonging to the race miliaceum are characterized by a large open inflorescence with suberect branches that are sparingly subdivided, whereas those belonging to the patentissimum are characterized by slender and diffused panicle branches.Understanding the diversity and population structure of germplasm resources is important for their use in crop improvement programs.Population structure analysis revealed the presence of five populations in the proso millet diversity panel, which did not correspond with these race designations.Instead, the populations corresponded well with geography.Four of the populations consisted almost entirely of Asian accessions, indicating greater genetic diversity, whereas almost all the European accessions clustered into a single population.These results support that Asia is the centre of origin and diversity of proso millet, followed by a spread westward across Europe 3,13,26 .In our previous study on the same subset, diversity and population structure were estimated using morpho-agronomic data, indicating that the accessions of proso millet were structured largely according to geographical region.Accessions originating in Asia and Europe were distinctly grouped, also accessions from Fig. 6.Genome-wide linkage disequilibrium (LD) decay in proso millet using GBS-based SNPs.Black dots represent individual SNP pairs on the same chromosome, while the red line shows the mean value.
Asia showed high diversity (average distance 0.268) relative to those from Europe (average distance 0.225), and high diversity was observed between accessions of Asia and Europe (average distance 0.301) 26 .
Advances in NGS technologies and the availability of a draft genome for proso millet can accelerate genomicsassisted crop improvement 16,17 .In proso millet, Rajput et al. 28 reported QTLs for morpho-agronomic traits using bi-parental mapping, while there are only two reports available on GWAS for agronomic and seed traits 24,25 , and no report on grain nutrients.In this study, a diversity set of proso millets representing five races originating from 30 countries was genotyped using the GBS approach.After filtering, 160 accessions originating from 26 countries and 5,621 quality SNPs were used to perform GWAS on agronomic and grain nutrient traits.A total of 40 MTAs were identified: 34 for agronomic traits and six for grain nutrient traits.Nine MTAs (two for flag leaf blade length, five for inflorescence length, and one each for plant height and paniel exsertion) were identified as linked with the phenotypic trait of variation in both years, and five of them showed significant associations in both the years as well as when the years were combined.Long inflorescences and tall plants are among the important traits that are positively associated with higher grain yield in proso millet 26 .Of the seven SNPs that were associated with inflorescence length, four were associated in both years as well as in the combined data.Among these, four SNPs showed a positive effect (Proso.1_8346815,Proso.12_41890075,Proso.14_27820106, and Proso.17_3253916),whereas Proso.12_34047515 and Proso.9_39130372 had a negative effect on inflorescence length.The four SNPs, Proso.17_3253916,Proso.1_8346815,Proso.14_27820106, and Proso.12_34047515, are located on genes PM17G03120, PM01G10560, PM14G15020, and PM12G21200, respectively, indicating potential candidate genes for yield improvement in proso millet.For plant height, two SNPs were identified, located on chromosomes 7 and 9.The SNP Proso.7_1535098showed significant association in both years as well as in the  www.nature.com/scientificreports/combined data, and is located in the gene TE347748 with gene function of "putative pentatricopeptide repeatcontaining protein" and "ninja-family protein 8-like" (Supplementary Table 2).Six SNPs were identified for grain nutrient content, of which two were located in the genes.For all the MTAs identified in this study, a box plot showing alleles and their phenotypic values was estimated, which is important for further use of these SNPs in the genomics-assisted improvement of traits (Fig. 8, Supplementary Fig. 2).Sequence similarity of the identified genes was compared with related species, and gene functions were reported, which will help in understanding the genetic basis of phenotypic variation of different traits.
In conclusion, proso millet is a potential crop for food security and nutrition and has various climate resilience and nutritional benefits.However, systematic breeding and genomics-assisted improvements in proso millet are very limited.In this study, the NGS-based genotypic characterization of proso millet revealed a wider diversity within and among races, and proso millet germplasm diversity was structured according to the two geographical regions where proso millet was reported to originate and be domesticated.Genome-wide association mapping identified 40 marker-trait associations for agronomic traits (34) and grain nutrients (6), most of which were located within the genes.The information generated from this study on diversity and marker-trait associations can support the development of allele-specific markers for mining productivity and nutrient traits, and their utilization in genomic-assisted proso millet improvement.

Experimental materials and conditions
The experimental material consisted of 200 accessions which includes the core collection (106 accessions) 29 .Number of accessions in the core collection is less for GWAS therefore we followed same approach by which core collection was established to make a diversity subset of 200 accessions from the entire collection of 849 accessions conserved in the genebank (http:// geneb ank.icris at.org/).These accessions were planted during the 2015 and 2016 rainy seasons at ICRISAT (Patancheru, Telangana, India; 17° 30′ N latitude, 78° 15′ E longitude and altitude 545 MSL) in an alpha design with two replications on red soil, planted in the third week of July during both years.Sowings were performed on ridges 60 cm apart, and each accession occupied a single row of 4 m in length.Plant-to-plant spacing of approximately 10 cm was maintained by thinning the excess seedlings.Diammonium phosphate was applied at a rate of 100 kg/ha as a basal dose to supply nitrogen and phosphorus.In addition, 100 kg/ha of urea was applied as top dressing.Irrigation and hand-weeding were performed on a need-based basis.

Data collection
Data on 14 agronomic traits were recorded using the descriptors of Panicum miliaceum 30 .The agronomic traits of days to 50% flowering, days to maturity, and grain yields were recorded on a plot basis, while the other agronomic traits (plant height, basal tillers, flag leaf blade length, flag leaf blade width, flag leaf sheath length, peduncle length, panicle extension, Inflorescence length, number of nodes, and inflorescence primary branch number) were recorded on the main culms of the five representative plants in a plot.Bulked seeds of each accession were used to determine the 100-seed weight.The grain yield per plot was converted into grain yield (kg/ha).A random, well-cleaned grain sample (unhusked) from each accession was used to estimate the grain protein, calcium (Ca), iron (Fe), and zinc (Zn) content at the Charles Renard Analytical Laboratory, ICRISAT, Patancheru, India.Grain Ca, Fe, and Zn contents were assessed following the nitric acid-hydrogen peroxide digestion method, and Ca, Fe, and Zn in the digests were analyzed using inductively coupled plasma-optical emission spectrometry (ICP-OES) 31 .Protein content in grain samples was determined using the sulfuric acid-selenium digestion method.Total nitrogen (N) was estimated using a Skalar Autoanalyzer, and protein % was calculated as N% × 6.25 conversion factor 32 .

Phenotypic data analysis
Data were analyzed for each rainy season separately and pooled following Residual Maximum Likelihood (REML) 33 in GenStat, 17th edition (http:// www.genst at.co.uk) considering genotypes as random and seasons as fixed effects.The significance of seasons was tested using Wald's statistics 34 .The Best Linear Unbiased Predictors (BLUPs) were obtained for all the traits for each accession for individual seasons, pooled over two rainy seasons, and used for genome-wide association studies.

Genotyping and SNP calling
DNA extraction and SNP calling from genotyping-by-sequencing (GBS) data have been described in detail in our previous publication 35 .In brief, DNA was extracted from each accession following the modified CTAB method 36 , lyophilized, and shipped to the Genomic Diversity Facility at Cornell University for GBS 37 .GBS library preparation followed the standard method 38 using a single PstI restriction enzyme.Samples were multiplexed into two lanes of 95 samples plus one blank for sequencing on an Illumina HiSeq 2500 with single-end 100 bp sequencing.The sequences were mapped to proso millet reference genome Panicum miliaceum (vPm_0390_v1) (https:// andge nomev oluti on.org/ coge/ Searc hResu lts.pl?s= 52484 &p= genome) 17 using Bowtie v2.2.4 39 .SNPs were called using the GBS v2 pipeline in TASSEL v4.3.6.Raw SNPs were filtered by removing any sites with greater than 20% missing, less than 0.1 proportion heterozygous, and a minor allele frequency of < 0.025.Accessions with more than 20% of their missing sites were filtered out.This resulted in 5621 high-confidence SNPs that were used for GWAS.AMOVA was computed to determine the presence of significant variation in the collection and assess the contribution of different stratifications to diversity.Principal Component Analysis was used to summarize and obtain preliminary knowledge about the diversity within the collection.The hierarchical population structure was estimated using the ADMIXTURE program, which is a model-based estimation of ancestry in unrelated individuals using the maximum-likelihood method 40 .ADMIXTURE implements a cross-validation (CV) feature that allows, together with the number of iterations to converge, the determination of the number of subpopulations (k values) that best fit the data.After choosing the subpopulation level, individual accessions were assigned to the subpopulation if they had at least 60% membership in that respective population 41 .We calculated the Modified Roger's Distances (MRD) between samples 42,43 as, c) where p ij and q ij are the allele frequencies of jth and ith markers in the two samples under consideration, a i is the number of alleles in the ith marker and m refers to the number of markers.Clustering of accessions based on the MRD distances was performed using Ward's D2 hierarchical clustering algorithm 44 .

Fig. 1 .
Fig. 1.Frequency distribution of key agronomic and grain nutrient traits of proso millet evaluated in 2015 and 2016 at ICRISAT Patancheru, India.

Fig. 4 .
Fig. 4. Population structure assessment of proso millet collection: (a) PCA biplot of 160 accessions based on SNPs from GBS, (b) rate of change in cross-validation (CV) error between successive k-values (k values ranging from 1 to 10), and (c) model-based population structure in proso millet collection based on ADMIXTURE with K = 5 populations for the 160 accessions.

Fig. 5 .
Fig. 5. Cluster dendrogram of GBS-based SNPs, based on Modified Roger's Distances (innermost colors on the dendrogram represent clusters, shapes at the nodes of the dendrogram represent races, tiles surrounding the dendrogram represent the region, and colored outermost bars represent the ADMIXTURE proportions-based population structure.

Fig. 7 .Table 3 .
Fig. 7. Manhattan plots and quantile-quantile (QQ) plots for inflorescence length (INFL) of proso millet evaluated in 2015 and combined for both years.The chromosomes are marked with different colors along the horizontal axis.

Fig. 8 .
Fig. 8.The MTAs with alleles and trait values, for use in genomic assisted improvement of proso millet (Note: SNP name starts with the code DF = Days to 50% flowering; PLHT = Plant height, cm; INFL = Inflorescence length, mm; Protein %).

Table 2 .
Analysis of molecular variance (AMOVA) and Monte Carlo significance tests for 160 accessions.